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ABSTRACT 


Data duplication uses file checksum technique to identify the 
duplicate or redundant data rapidly and accurately. There may be the 
chance of inaccurate result which can be avoided by comparing the 
checksum of already exiting file with newly uploaded file. The file 
can be stored using multiple attributes such as file name, date and 
time, checksum, user id, and so on. When the user uploads the new 
files the system will generates the checksum of the file and compare 
it with the check of file that has already been stored. If the match is 
found then it will update the old entry otherwise new entry will be 


created into the database. 
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1. INTRODUCTION 

The collection of information is known as data. The 
data is increasing constantly in the digital universe. A 
study suggests that at end of 2020 each person will 
create 1.7 megabyte of data. It is also clear that the 
rate of data production per day is about 2.5 quintillion 
bytes of data. The reasons behind the growth of 
multiple data are: 

> Multiple backup of data or file by single person. 
> Misuses of social media. 


The hacking of the organisation system in 9/11 and 
loss of data caused by illegal activity proved that loss 
of data is major problem for the organization. This 
event forces the organization to implement data back 
of system in order to preserve their important data. 
The organizations started keeping regular backup of 
their data such as email, video audio etc. which 
increase their storage unit. While backing the data 
regularly, they end up with storing the duplicate data 
multiple times which is the misuse of storage. 


As the data is increasing constantly storing them and 
managing them becomes more difficult. More data 
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requires more storage and more storage require more 
cost as we have to increase the hardware or storage 
unit. Only increasing the storage unit is not the 
solution because we are not sure that how much 
storage unit we have to add. Adding more number of 
storage units makes system bulk and more costly. 


So, the solution to above problem is proper 
implementation of data duplication removal system. 
The data duplication removal method stores the data 
or file to the system if they are not stored previously. 
If the match is found then it will update the old entry. 
So this system will remove the duplicate data quickly 
and saves the precious storage units. 


2. SURVEY MOTIVATION 

"Di Pietro, Roberto, and Alessandro Sorniotti" 
discussed the security concern raised by de- 
duplication and to address this security concern the 
author utilizes the idea of Proof of Ownership 
(POW). POW are intended to permit server to verify 
whether a client possesses a file or not. 
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According To “Atishkathpal Matthew John Anf 
Gauravmakkar”, data duplication removal is the 
method of eliminating the duplicate data from the 
storage devices in order to minimize the consumption 
of memory in storage devices. Since, the concepts 
were good but their system cannot work as they 
intended due to poor management of hardware 
devices and not easy to use which result in the under 
performance of the system. 


2.1. GOAL 

Many work has been done in past in order to save the 
storage problem that is caused by data duplication. 
Data duplication has been the major problem and the 
technology developed in past was not able to solve 
the problem due to improper management of 
technology. 


2.2. LIMITATION 

> More processing time. 

> Chance of false result. 

> Not user friendly. 

> System maintenance is difficult. 


2.3. KEYWORDS 
Cloud computing, data storage, file checksum 
algorithms, computational infrastructure, duplication. 


3. SURVEY OUTCOMES 

Data Deduplication increases the amount of unwanted 
data in the storage unit by storing the multiple copy of 
same file. Data duplication removal technique uses 
file checksum technique to find duplicate or 
redundant data quickly. The technique calculates the 
checksum of the file when the file is uploaded and 
checks the newly calculated checksum with the 
checksum of file that are already store in database. If 
the file is already present it will modify the file else it 
will make new entry of file. In this system we are 
going to use MD-5 hash algorithm, to detect the 
duplicate file. MD-5 refers to Message Digest 
algorithm which is 128 bit hash algorithm. 


Advantages: 

> Faster file searching. 

> Reduce storage space by eliminating data 
redundancy. 

>» Ease to download and upload file. 


internet 


4. CONCLUSION 

This technique focus in developing web based 
application that can find the redundant data quickly 
and easily using file checksum technique. For 
calculating the checksum of already existing files and 
new file Message Digest (MD-5) algorithm is used. 
MD-5 algorithm is used to calculate the checksum as 
well as to provide the better security and encryption 
to the valuable files of users. Hence, this system 
removes duplicate file easily and quickly by 
providing better security. 
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